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AN  IMPLEMENTATION  OF 
THE  SINGULAR  VALUE  DECOMPOSITION 
ON  THE  CONNECTION  MACHINE  CM-2 

1.  INTRODUCTION 

In  recent  years,  the  singular  value  decomposition  (SVD)  has  become  an  important  tool  for 
modern  digital  signal  processing  to  find  higher  resolution  and  more  accurate  algorithms  to  extract 
underlying  signal  and  system  parameters  fiom  measurements.  The  SVD  implemented  in  the  LIN- 
PACK  [l]  scientific  library  was  designed  for  a  serial  or  vector  machine  and  is  not  directly  portable 
to  the  Connection  Machine,  which  is  of  data  parallel  architecture.  A  parallel  version  of  the  SVD 
is  explored  here. 

In  Section  2,  the  definition  and  important  properties  of  the  SVD  are  briefly  stated.  Section 
3  reviews  previous  implementations  of  the  SVD.  Section  4  describes  the  implementation  of  the 
algorithm  on  the  Connection  Machine.  Details  of  the  algorithm  and  results  are  also  given. 


2.  THE  SINGULAR  VALUE  DECOMPOSITION 


If  A  is  a  m  x  n  matrix  of  rank  r  then  there  exist  real  orthogonal  matrices  U  =  [uiU2-..um]  and 
V  =  [vxv2...vn]  such  that  A  =  UEV*  where 


£  =  U'AV 


diag(o\,oi,...oT)  0" 

0  0. 


r  <  min(m,n)  and  er,  >  <r,+1  >  0  for  i  -  l,..,r  -  1.  The  <j{  are  the  singular  values  of  A  and  the 
corresponding  vectors  u,  and  v*  are  respectively  the  ith  left  and  right  singular  vectors. 


The  most  valuable  aspects  of  the  SVD  for  digital  signal  processing  are  in  the  rank  and  the  dyadic 
decomposition  properties.  The  rank  property  says  that  the  singular  values  can  be  considered  as 
quantitative  measures  of  the  inexact  arithmetic  measures  of  the  exact  mathematic  notion  of  rank. 
The  dyadic  decomposition  describes  a  matrix  as  the  sum  of  r  rank-one  matrices  of  decreasing 
importance,  as  measured  by  the  singular  values: 


Rank  property:  rank(A)  =  r  where  o\  >  oi  >  ...  >  aT  >  0 

r 

Dyadic  decomposition:  A  =  £ 

«— i 
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With  these  properties,  the  application  of  the  SVD  to  signal  processing  and  to  a  wide  variety  of 
other  systems  is  often  where  a  linear  model  is  constructed  from  a  sequence  of  observed  data  vectors. 
The  complexity  of  the  system  is  reflected  by  the  rank  of  the  data  matrix,  and  the  parameters  of 
the  model  may  often  be  extracted  from  certain  subspace  spanned  by  singular  vectors. 

These  techniques  and  their  applications  to  many  problems  are  reviewed  in  Refs.  2  and  3.  For 
example,  the  linear  least  squares  method  uses  the  SVD  to  find  a  vector  of  model  parameter  x  such 
that  the  system  output  A  x  is  as  close  to  the  actual  observed  output  b  as  possible: 


x  =  A+b, 


the  pseudo-inverse  n  x  m  matrix  A+  =  VE+U‘,  with  E+  = 

The  SVD  and  the  Generalized  SVD  [4]  serve  as  the  basis  for  ESPRIT  [5],  a  technique  devel¬ 
oped  by  Roy  and  Kailath  for  applications  such  as  direction-of-arrival  (DO A)  estimation  in  which 
estimates  of  the  spatial  location  of  multiple  sources  whose  radiation  is  received  by  an  array  of 
sensors  are  sought.  While  somewhat  less  general  than  the  well  known  MUSIC[6]  method.  ES¬ 
PRIT  should  prove  to  be  more  practical  because  it  does  not  rely  on  complete  knowledge  of  the 
antenna  gain  patterns,  and  it  vastly  reduces  the  amount  of  calculations.  In  an  example  given  in 
Ref.  5,  a  factor  of  105  reduction  of  number  of  multiplications  over  MUSIC  was  estimated  for 
a  twenty-element  sensor  array  employed  to  cover  10  signals  in  an  aperture  of  2  radians  in  both 
elevation  and  azimuth,  with  one  milliradian  resolution. 


E-1  0 

0  0 


3.  PREVIOUS  WORK 

Theoretically,  the  SVD  may  be  performed  directly  following  the  observation  that  the  singular 
values  <7,  are  simply  the  nonnegative  roots  of  the  largest  eigenvalues  of  the  matrix  AAf,  and  the 
singular  vectors  u,  and  v,  are  the  corresponding  eigenvectors  of  AA*  and  A*A.  In  practice,  the 
loss  of  numerical  precision  becomes  so  severe  that  smaller  singular  values  are  rendered  incorrect 

[7]- 


The  most  widely  used  algorithms  used  on  serial  machines  are  variants  of  those  proposed  by 
Golub  and  Reinsch  [8]  and  Golub  and  Kalan  [9],  in  which  the  given  matrix  is  bidiagonalized,  then 
the  QR  method  is  used  to  compute  the  singular  values  of  the  resultant  bidiagonal  form.  This 
method  is  inherently  unsuitable  for  parallel  processing  [10,11]. 

The  one-sided  Jacobi  method  credited  to  Hestenes  [12,13]  and  the  two-sided  variant  [13,14] 
that  were  superseded  by  the  Golub  serial  algorithms  are  apparently  suitable  to  parallel  process¬ 
ing  because  all  row-pairs  in  the  matrix  may  be  processed  concurrently  and  each  element  of  each 
row  may  also  be  operated  on  during  the  rotations.  In  the  Jacobi  iteration  process  the  pair-wise 
rotations  must  be  done  in  a  particular  order  for  the  process  to  converge.  The  standard  cyclic- 
by-rows  method  for  Jacobi  iteration  []5],  wfe-h  involves  the  sequential  nioces^irg  of  row- pairs 
(l,2j,(i,3),..,(l,m),(2,3),(2,4),...,(2,m),...,(m-  l,m)  is  not  suitable  for  concurrent  processing 
because  of  the  obvious  data  dependency.  Many  other  methods  to  process  row-pairs  concurrently 
are  reviewed  in  Ref.  1G.  The  permutation  scheme  described  in  this  report  is  akin  to  the  bubble-sort 
algorithm  [17]  in  which  each  neighboring  pair  is  transformed  by  a  rotation  that  leaves  the  larger 
(in  the  norm  sense)  row  on  top. 
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Ewerbring  et  al.  [11]  implemented  a  similar  algorithm  on  the  Connection  Machine  using  a  par¬ 
allel  variant  of  Lisp.  Their  report  did  not  state  the  execution  time.  The  implementation  described 
here  is  in  Fortran,  maps  more  matrix  elements  to  a  processor  and  uses  a  different  permutation 
scheme. 


4.  IMPLEMENTATION  OF  THE  SVD  ON  THE  CM-2 

The  massively  parallel  computer  CM-2  on  which  the  code  runs  is  described  in  Section  4.1.  In 
Section  4.2,  formulae  to  generate  the  rotation  matrix  and  the  permutation  scheme  are  described 
in  detail.  Results  are  discussed  in  Section  4.3.  The  Fortran  code  is  included  in  the  Appendices. 


4.1.  Connection  Machine  CM-2 


Initially,  the  Connection  Machine  machine  model  was  a  single  instruction  multiple  data  (SIMD) 
array  of  up  to  64K  (K=1024)  bit-serial  processors  connected  by  a  hypercube  bit-serial  interconnect 
network.  This  paradigm  is  natural  and  useful  in  a  number  of  applications,  such  as  the  method  of 
discrete  simulation  of  fluid  flow  [18]  in  which  each  processor  is  mapped  onto  a  “cell”  of  the  fluid 
body  which  interfaces  only  with  a  number  of  neighbor  cells. 

In  the  second  generation  CM  machine  [19],  a  32-bit  or  64-bit  Weitek  floating  point  arithmetic 
unit  (FPU)  was  added  to  each  group  of  32  processors  to  provide  fast  single  or  double  precision 
floating  point  capability. 

The  virtual  processor  concept  allows  automatic  mapping  of  problems  that  require  more  nodes 
than  are  available  in  the  physical  machine.  In  this  virtual  processor  mode,  every  instruction  is 
executed  n  times,  where  the  vp  ratio  n  is  the  ratio  of  number  of  problem- domain  nodes  to  the 
actual  number  of  processors.  The  problem  size  is  thus  limited  by  the  amount  of  memory  in 
each  physical  processor.  At  the  Naval  Research  Laboratory  Connection  Machine  Facility,  the  16K 
processor  double  precision  CM-2  has  1  Megabit  of  memory  per  processor. 

The  core  of  the  machine  operation  is  in  downloadable  microcode.  User  programming  languages 
include  an  assembly  language  called  Paris  and  parallel  versions  of  other  common  high-level  lan¬ 
guages  (HLL ):*Lisp,  C*and  CM  Fortran.  CM  Fortran  is  based  on  Fortran-8X,  which  is  similar  to 
Fortran-77,  augmented  with  array  operations. 

The  recently  introduced  slicewise  Fortran  compiler  used  for  this  work  employs  a  different  ma¬ 
chine  model.  The  machine  is  presented  to  the  compiler  as  up  to  2048  depth-4  pipelined  floating 
point  nodes;  each  node  is  a  32-bit  or  64-bit  processor.  For  certain  problems,  this  machine  model 
produces  compiled  codes  that  are  two  to  three  times  faster  than  the  fieldwise  modeled  compiler  by 
streamlining  of  data  in  and  out  of  the  FPUs,  and  by  using  in-place  FPU  calculations. 

The  theoretical  single  precision,  peak  floating-point  performance  of  a  full  (64K)  CM-2  is  27 
GFLOP^  turning  that  all  of  the  floating  point  chips  in  the  machine  perform  a  multiplication  and 
an  addition  every  clock  cycle.  On  a  full  CM-2  with  32-bit  FPU  and  microcode  version  5.0,  Levit  [20] 
reported  a  much  lower  peak  performance  of  users  code  without  interprocessor  communication.  This 
so-called  memory-bandwith-bound  peak  performance  is  cited  to  be  5.17  GFLOPS.  For  a  16K  64-bit 
FPU,  roughly  800  MFLOPS  is  expected  for  the  communication-free  portions  of  the  code.  Grid 
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communication  (between  power-of-two  interprocessor  points)  io  only  73  MIPS  (in  terms  of  number 
of  32-bit  words  communicated  per  second)  at  vp  ratio  1,  to  1375  MIPS  for  adjacent  communication 
at  a  high  vp  ratio.  The  fast  Fourier  transform  (FFT)  has  been  coded  and  is  reported  to  execute 
at  a  sustained  rate  of  more  than  1  GFLOPS  for  very  large  FFTs  on  a  64K  CM-2  [21]. 


4.2.  SVD  Algorithm 


Consider  mapping  each  element  of  a  real  matrix  A  of  size  m  X  n  onto  a  node  on  a  2-D  array 
of  virtual  processors  on  the  CM-2.  Transformations  of  the  matrix  that  require  change  to  the  value 
of  each  element  may  take  place  on  all  processors  simultaneously.  The  Hestenes  one-sided  Jacobi 
iteration  algorithm  exploits  this  concurrency. 


One-Sided  Jacobi  Rotation 

m  ^  ^ 

Denote  a  matrix  A  in  as  A22  to  emphasize  that  2-element  matrix  operations  are  to 

be  performed  on  the  y  pairs  of  the  n-element  rows. 

In  Hestenes’s  construction  of  the  SVD,  two  rows  of  the  matrix  are  rotated  to  be  orthogonal 
then  permuted  with  other  rows  to  continue  the  process  until  all  are  mutually  orthogonal.  This  is 
achieved  by  multiplying  each  pair  of  elements  from  the  row  pairs  by  a  sequence  of  rotations  {Rfc}, 

—  X  n  —xn 

R  =  Rj*x  2  •  The  rotated  result  is  stored  in  a  matrix  H  =  H22 


The  product  of  the  rotation  matrices  is  constructed  by  applying  {Rjt}  to  an  initial  identity 
matrix  I  =  I2  Xm  during  the  iterations.  The  result  is  kept  in  matrix  U*. 


Xn 


where 


MJxm=ni^] 


xm  ^  xm 
2X2  *2 


(i) 


Note  that  in  Eq.(  1),  the  rotation  matrices  R^  are  replicated  m  times  in  each  row  to  match  the 
dimensions  of  I. 

After  normalizing  each  row  i  of  H  by  its  norm  H  =  [hih2...h,...hm]1  may  be  factored  into 
as  a  product  of  a  pseudo-diagonal  matrix  (a  diagonal  matrix  concatenated  with  null  rows)  £  and 
an  orthonormal  matrix  V*. 


H  =  [h1h2...h,...hr...hm]t  =  £V(, 

where 

I  hi  I  =  <7!  >  |h2|  =  o2...  >  | hr  |  =  <rr  >  0; 
V*  =  [v  j  v2...V|...vm]*, 

where 

_  f  h,/cq,  i  =  1  ,...,r 
V‘  \  0,  i  >  r 


(2) 
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diag(oua2,...(7r)  0 


In  Eqs.  (1)  and  (2),  since  Ut  and  V  are  orthonormal  and  E  is  pseudo-diagonal,  we  have  the 
defining  equation  for  the  SVD: 

EV*  =  U*A.  (3) 

Two  rows  x  and  y  are  to  be  rotated  into  rows  X  and  Y,  respectively  by  using  the  premultiplier 
rotation  matrix  R  : 

(?)-*(;:) £5) CO-  «> 

The  first  criterion  for  selecting  a  value  for  the  rotation  angle  0  is  for  the  resultant  rows  to  be 
orthogonal: 

X‘Y  =  0.  (5) 

Defining 

a  =  x*y 

£  =  M2  -  lyl2  (6) 

7  =  (a2  +  /?2)5, 

and  substituting  the  expansion  of  the  right-hand  side  of  Eq.  (  4)  into  Eq.  (5),  we  have 

tan  20  = 

cos  2$  =.  (7) 

sin  20  ~  =  2  sin  0  cos  0. 

The  ±  sign  ambiguity  corresponds  to  the  £  ambiguity  of  20  which  can  be  resolved  by  an 
additional  constraint  that  the  norms  of  the  rows  become  more  orderly  through  each  rotation  in 
order  for  the  rotation  process  to  converge.  In  fact,  it  will  soon  be  shown  that  the  +  sign  for  cos  20 
and  sin  20  results  in  a  rotation  that  puts  the  larger  norm  on  top  while  the  -  sign  results  in  the 
smaller  norm  on  top. 

The  rotation  matrix  coefficients  may  then  be  derived  from  the  above  using  the  half-angle 
trigonometry  identities.  Thus: 

cos<?  =  ±(i±£§ 2*1)7  =  ±(a±£)* 

sin  0  =  j 

An  arbitrary  limitation  of  |0|  <  ^  has  been  found  to  help  in  the  convergence  [22].  In  Eq.(  8), 
this  limitation  is  imposed  by  selecting  the  positive  value  for  cos0.  The  sign  of  sin#  is  the  same  as 
that  of  sin  20  which  is  determined  by  the  sign  of  a. 

To  see  the  significance  of  the  sign  for  cos  20  and  sin  20,  calculate  the  change  in  norms  of  the 
row,  say  x: 


X‘X  -  x‘x  =  ia  sin  20  -  0  sin2  0 
-  +ai  _  -  ±2- 

—  X  2-y  r  2-r  ~ 


5 


N.  A.  CHU 


If  a  +  sign  is  chosen  in  place  of  the  ±  in  the  Eq.  (  9),  the  change  in  norm  becomes 

X'X-x‘x  =  ^A  (10) 

£ 

in  which  case,  |x|  has  increased  since  the  right-hand  side  of  Eq.  (10)  is  greater  than  zero  (The 
case  of  7  =  /3  is  considered  separately  as  discussed  below).  A  similar  proof  may  be  carried  out  for 
y'y-Y'Y. 

X<X-x'x  =  y‘y-Y'Y  =  l^>0.  (11) 

If  a  -  minus  is  chosen,  Eq.(  10)  becomes  —  which  is  <  0. 


Numerical  Issues 


Equations  (8)  should  be  used  carefully.  Specifically,  avoid  the  case  when  (7  ±  /?)  requires  a 
subtraction  that  results  in  a  loss  of  accuracy.  An  improved  algorithm  to  construct  the  rotation 
matrix  R  is: 


If  /?  >  0,  calculate  cos 0  =  (-j^)5 

If  j3  <  0,  calculate  sin  9  =  sign(a)(:i^~)i 


then  calculate 
then  calculate 


sin  6  = 
cos  6  = 


2*7  cos  9  1 

Of 

2 'y  sin  6  * 


(12) 


On  a  digital  computer,  the  orthogonality  condition  in  Eq.  (5)  can  be  satisfied  to  within  a 
quantity  equivalent  to  the  norm  of  a  null  row.  The  orthogonality  condition  (based  on  Ref.  22)  is: 


where 


x‘y  <  6  min(|xj,  jy|) 


6  = 


e|A| 

(  m  n 


=  <  I  E  £  4 

V.=u=i  , 


(13) 


(14) 


The  single  precision  (32-bit)  and  double  precision  (64-bit)  floating-point  machine  precisions  are 
1.17e  -  7  and  2.22e  -  16,  respectively. 

When  the  norm  of  either  vector  becomes  less  than  6,  the  rotation  becomes  meaningless  and 
could  be  avoided.  On  a  conventional  machine,  avoiding  these  calculations  may  reduce  computation 
time.  On  the  CM-2,  however,  no  saving  is  expected  because  the  entire  array  has  to  be  operated 
on.  Taking  note  of  the  occurences  of  null  norm  and  of  orthogonal  row-pairs ,  however,  serves  to 
establish  the  stopping  criterion  of  the  iteration,  namely: 

Stop  when  for  all  row  pairs  (x,,y,), i  =  l,..,y, 

|xi|  <  6  or  |y,|  <  6  or  (xjyi)  <  6  min(|x,|,  |y,|).  (15) 


The  calculation  of  the  norms  of  the  rows  in  Eq.  (6)  is  expensive  in  terms  of  execution  time  on 
the  CM-2  because  of  the  interprocessor  communication  involved  in  adding  the  square  of  the  row 
elements,  each  assigned  to  a  virtual  processor.  Alternatively,  new  values  may  be  computed  from 
the  elements  of  the  rotation  matrix  R  and  the  current  values  of  the  square-  norms.  The  loss  of 
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Fig.  1  —  Permutation  scheme  to  visit  all  row  pairs  in  a  sweep,  (a)  The  general  structure  is  similar  to  a  bubble- 
sort  using  neighbor  exchanges.  Rotation  followed  by  exchange  achieves  the  required  sorting  effect  that  makes  the 
permutation  scheme  converges  to  a  bubble-sort,  (b)  The  rotation  tends  to  put  the  larger  (norm)  row  on  top,  thus  in 
an  odd  sweep,  an  exchange  is  required  after  rotation  to  permute  the  rows,  c)  In  an  even  sweep,  the  inputs  must  first 
be  exchanged  so  that  the  order  is  enforced  after  the  subsequent  rotation. 


accuracy  in  this  calculation  is  sufficiently  small  for  the  algorithm  to  converge — care  must  be  taken 
to  avoid  calculating  the  norm  since  it  would  mean  taking  the  square-root  of  a  negative  reed  number. 
Experiments  with  the  code  showed  that  there  was  no  convergence  penalty  in  terms  of  number  of 
sweeps. 

Permutation  Scheme 

On  the  Connection  Machine,  high-speed  algorithms  must  be  designed  with  special  care  in  the 
assignment  of  variables  that  reside  on  different  processors.  A  general  assignment  takes  on  the  order 
of  a  millisecond  to  send  data  between  arbitrary  processors,  while  an  assignment  using  specialized 
communication  calls,  such  as  eshift  to  exenange  data  in  a  systolic  manner,  is  an  order  of  magnitude 
faster.  Special  hardware  is  used  in  the  high  speed  execution  of  a  set  of  specialized  communication 
utilities  that  includes  scan,  global,  reduce,  spread  and  multispread  to  implement  the  broadcast 
and/or  accumulation  of  values  to/from  n  processors. 

The  desirability  of  the  nearest-neighbor  systolic  communication  and  the  mesh  layout  of  the 
matrix  leads  naturally  to  a  pairing  scheme  as  illustrated  in  Fig.  1  (a).  For  the  purpose  of  illustration, 
each  of  the  m  rows  of  the  matrix  is  indicated  by  an  index  (1, ...,  m) — m  even.  Suppose  for  a  moment 
that  the  rows  are  placed  in  descending  order  (from  left  to  right)  according  to  the  vector  norm,  i.e. 
|xx  |  >  |x2 1  >  ...  >  jxm | .  The  rows  are  then  exchanged  pairwise:  (1, 2),  (3, 4),  (5, 6), ...,  (m  -  3,m  - 
2),  (m  -  1,  m)to  become  (2, 1),  (4, 3), (6, 5), ...,  (m  -  2,  m  -  3 ),(m,m  -  1).  In  the  next  iteration,  the 
process  is  repeated  without  the  first  and  m-th  rows:  (2),(l,4),(3,6),...,(m  -  3,m).  The  iteration 
repeats  even-odd  for  a  total  of  m  cycles  (a  sweep)  after  which  all  pairing  of  the  rows  (1  ,...,m) 
have  been  visited  and  the  norm  ordering  is  reversed,  i.e.  m,  m  -  1, ...,  2, 1. 

If  the  row  norms  are  not  ordered,  the  same  sorting  effect  described  above  can  also  be  achieved 
if  each  pair  is  exchanged  conditionally  on  a  particular  ordering.  When  the  conditional  pairwise 


n.  a.  cm; 


exchange  is  followed  by  neighbor  swap,  we  have  a  permutation  identical  to  that  in  th  familiar 
bubble  sort  algorithm. 

The  rotation  with  matrix  R  (Eq.  (4))  also  converges  into  a  bubble-sort  transformation  because 
of  the  ordered-norm  conditions  described  in  Eq.  (11).  For  each  row-pair  (x,y),  as  |x|  keeps  in¬ 
creasing  and  |y|  keeps  decreasing,  eventually  |x|  >  |y|.  Experience  with  t'  is  SVD  algorithm  shows 
that  this  ordered  state  is  usually  achieved  in  the  first  couple  of  sweeps. 

Fig.  1(b)  illustrates  this  concept;  in  each  of  the  m  stages  of  an  odd-numbered  sweep,  each 
row-pair  is  shown  to  feed  into  an  oval  icon  representing  the  premultiplication  by  matrix  R.  After 
the  rotation,  the  results  are  interchanged  to  realize  the  bubble-sort. 

In  Fig.  1(c)  which  illustrates  an  even-numbered  sweep,  each  input  row-pair  is  unconditionally 
exchanged  before  entering  the  rotation  icon.  The  necessity  of  this  step  is  clear  if  one  keeps  in 
mind  that  the  rotation  tends  (over  a  few  iterations)  to  make  the  norm  of  the  upper  output  larger 
than  that  of  the  lower.  Upon  entering  an  even-numbered  sweep,  if  this  exchange  is  not  male 
to  put  the  larger  norm  row  on  top,  the  subsequent  rotation  would  effectively  undo  the  rotation 
of  the  preceding  odd-numb  jred  sweep.  In  the  even-numbered  sweeps,  no  exchange  is  required  at 
the  output  because  the  output  norms  ordering  is  to  be  reversed  from  the  -'  der  produced  by  the 
previous  rotation  (larger  norm  on  top). 

An  alternative  approach  is  to  use  a  different  set  of  values  for  R  such  that  the  rotation  will  result 
in  a  smaller  norm  on  top.  This  requires  a  change  in  Eqs.  (12)  that  involves  reversing  the  sign  of 
3  and  the  order  of  calculating  cos  6  and  sinfl.  The  simple  mapping  of  one  floating-point  processor 
node  per  matrix  element  actually  uses  only  half  the  resources  for  computation  because  the  rotation 
of  each  row-pair  is  identical  for  each  of  the  elements  of  the  pair.  The  solution  used  here  is  to  assign 
a  row-pair  to  one  processor-row  to  make  full  use  of  all  processor  nodes  for  actual  calculations. 
More  significantly,  the  number  of  interprocessor  communication  steps  is  reduced  proportionally; 
this  should  significantly  reduce  execution  time.  In  fact,  experiments  showed  this  improved  mapping 
reduced  the  execution  time  by  an  order  of  magnitude. 

Matrix  Shape. 

As  indicated  at  the  beginning  of  Section  4.2.  if  the  matrix  A  is  m  X  n,  then  U  and  V  are 
rn  x  m  and  n  x  n,  respectively.  When  m  ^  n ,  the  constructive  algorithm  described  above  becomes 
somewhat  cumbersome  for  the  2-D  layout  on  the  CM-2.  If  m  is  slightly  more  or  less  than  n,  the 
matrix  A  may  be  simply  padded  with  |m  -  n\  null  rows  or  columns.  However,  if  m  >  n,  the 
algorithm  will  have  to  be  modified  t'>  avoid  working  directly  with  the  large  m  X  m  matrix  U.  In 
this  case.  Uf  may  be  internally  processed  sequentially  as  y  matrices,  each  of  size  m  x  n.  See 
Appendix  B  for  detail. 


4.3.  Results 

A  CM- For* ran  subroutine  was  written  accurding  to  the  algorithm  and  requirements  presented 
in  the  preceding  section.  Appendix  A  contains  the  source  codes  for  the  subroutine  that  is  specific 
for  the  case  of  m  ~  n.  Appendix  B  contains  a  modified  version  of  Appendix  A  for  the  general  case 
of  m  >  n.  Appendix  C  contains  source  codes  for  a  test  driver.  The  codes  were  tested  on  random 
real  matrices.  The  Connection  Machine  used  was  a  1CK  CM-2  with  64-bit  FPU. 
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Table  1  —  Coinp^'ing  CM-2  execution  times  with  UNPACK.  The  CM  Code  was  compiled  by  a  slicewise  Fortran 
compile-.  The  LIN  PACK  codes  were  run  on  a  very  lighltly  loaded  Sun-4/280  and  one  processor  of  a  Convex  C21G. 
The  normalized  residual  error  after  12  sweeps  was  on  the  order  of  le  —  16. 


Double  Precision 

#  processors 

exec  time 

#  sweeps 

512 

|  :  1 1 1 

280  sec 

msm 

256 

39  sec 

mri 

512 

Convex 

1 

141  sec 

256 

Convex 

1 

21  sec 

256 

Sun-u 

1 

305  sec 

In  Table  1,  the  execution  times  for  m  =  n  axe  compared  against  the  execution  times  of  the 
LIN  PACK  dsvdc  codes  on  one  processor  of  a  Convex  C210  and  a  Sun/4-?80.  Both  the  Convex  and 
Sun/4-280  codes  were  compiled  by  using  the  respective  Fortran  compiler  with  optimization;  loads 
were  minimal.  The  codes  for  the  general  cases  where  m  >  n  were  slightly  slower. 

Table  1  shows  that  the  LIN  PACK  implementation  on  one  procressor  of  a  Convex  C210  is  2 
times  faster  than  the  CM- Fortran  implementation  on  the  16K  CM-2  at  m  =  n  =  256  and  512.  A 
full  (64K)  CM-2  is  expected  to  run  between  3  to  4  times  faster  than  a  16K  CM-2.  It  is  rea.  onable 
to  conclude  that  the  CM-2  and  Convex  implementations  are  comparable  in  execution  times. 

For  a  great  majority  of  runs  on  random  matrices,  the  number  of  row  rotations  begins  to  drop 
off  at  the  end  of  8  sweeps,  and  down  to  0  by  the  end  of  12  sweeps.  Accuracy  was  checked  by 
computing 

Error  =  max  I  A.  —  USV(|.  (16) 

Krrors  in  the  CM-2  runs  were  on  the  order  of  le-14  for  double  precision  and  le-5  for  single  precision. 
After  normalizing  by  the  F-norm  of  A,  these  errors  were  on  the  order  of  the  respective  machine 
precisions.  To  gauge  the  efficiency  in  the  usage  of  main  hardware  components  (the  FPUs),  the 
number  of  floating  point  operations  in  the  innerloop  of  the  Fortran  code  (subroutine  svdcore  in 
Appendix  A)  was  counted.  By  using  a  weight  of  4,  2,  and  1  for  square-  root,  division,  and 
multiplication/addition/logical  ■'-'■spectively,  the  FLOP  count  Q  —  100  per  virtual  processor  per 
loop  per  sweep.  This  included  20  for  the  calculation  of  /3,  7  and  the  conditions  for  rotation  (Eq.  (6) 
and  (15)),  40  each  for  (3  >  0  and  (3  <  0  for  the  calculation  of  cos0,  sin  9  and  the  subsequent  rotations 
according  to  Fqs  (4)  and  (12).  (On  the  CM-2,  either-or  code  segments  are  sequentially  executed 
and  thus  must  be  counted  towards  the  FPU  usage.) 

771 

Qflop  =  100  (—n)ml„  (17) 

where  I,  is  the  number  of  sweeps.  By  using  Eq.  (17)  and  the  results  of  Table  1,  the  throughput 
rates  far  the  double  precision  runs  are  258  and  288  MFLOPS  for  matrix  size  256x256  and  512x512, 
respectively,  on  the  16K  CM-2. 

Interprocessor  comnr’nication  associated  with  the  calculation  of  the  dot.  product  of  the  row 
pai~s  [a  in  Eq.  (6))  and  the  systolic  communication  steps  was  timed  to  be  30%  and  22%  of  the 
total  execution  time  for  n  ~  256  and  512,  respectively.  After  accounting  for  the  communication 
time,  the  performance  shown  in  Table  1  is  within  a  factor  of  2  to  2.5  of  C  e  peak-memory-bound- 
performance  of  the  machine. 
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Table  2  —  Execution  time  per  sweep  (seconds).  The  prereleased  slicewise  Fortran  compiler  with  some  unrolling  of 
codes  to  streamline  the  inner  loop  improved  execution  times  by  1.7  and  1.3  times  for  the  smaller  vp  ratios  (n  =  250 
and  512  respectively). 


Double  Precision 

Size 

Fieldwise 

Slicewise,  unrolled 

Fieldwise,  unrolled 

512 

35 

23 

30 

256 

7.5 

3 

5 

Table  2  shows  the  execution  time  per  sweep  in  units  of  seconds  for  the  cases  m  =  n  =  256 
?  <d  512  by  three  versions  of  the  CM  Fortran  code  on  the  16K  CM-2.  The  best  performance  was 
achieved  when  the  matrix  was  laid  out  in  slicewise  mode  and  the  inner  loop  was  unrolled  to  remove 
conditionals  that  fragmented  the  code. 
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Appendix  A 


CM  FORTRAN  CODE  FOR  SUBROUTINE  SVD  OPTIMIZED  FOR  m  ~  n 


N.B.- The  main  subroutine  svd  contains  5  units:  svdcore  contains  the  main  Jacobi  rotation  codes; 
two2one  allocates  arrays  on  the  CM  two  row-elements  per  processor  while  one2two  performs 
the  reverse  process;  evaluatel  calculates  the  S  and  V  matrices;  evaluate2  calculates  the  residual 
error.  In  this  unrolled  version,  svdcore  similar  chunks  of  codes  are  sequentially  repeated  4  times, 
one  slightly  different  from  the  others.  This  is  to  avoid  invoking  if-then  clauses  that  would  otherwise 
fragment  the  resulting  codes. 


subroutine  svd  (ab,ub,vb,sv,m,n,irank, isweep, eps) 

C  Author:  Nhi-Anh  Chu 

C  Connection  Machine  Facility 

C  Code  5153,  Naval  Research  Lab 

C  Nov  9  1990 

C  Revised  Jan  3  1991 

C  ab  —  2m  x  n  matrix  A,  to  be  decomposed  into  singular  values 

C  sv  =  diag  (S)  such  that  A  =  (U  S  Vt) 

C  ab  is  returned  as  (Ut  A)  where  Ut  is  product  of  Jacobi  rotation 

C  matrices  on  (At  A) 

C  ub  —  2m  x  n  matrix  returned  with  Ut 

C  vb  —  2m  x  n  matrix  returned  with  Vt 

C  sv  —  2m-vector  returned  with  diag(S) ,  the  singular  values  of  A 

C  irank  —  integer  returned  with  the  rank  of  A 

C  isweep  —  integer  returned  with  number  of  sweeps  of  the  rotations; 

C  each  sweep  orthogonalize  every  row-pair  combinations  of  A 

C  eps  —  real  number  specifying  the  machine  precision,  used  to  determine 

C  a  "zero" 

integer  m,  n,  irank,  isweep 

real  ab(2*m,  n) ,  ub(2*m,  n) ,  vb(2*m,  n) ,  Bv(2*m) 
real  eps,  deltas 

real  a(m,  n) ,  ap(m,n),  u(m,n) ,  up(m,n),  v(m,n),  vp(m,n) 
read.  a_original(2*m,n) 

cmf$  layout  a(:news,  :news),  ap(:news,  :news),  u(:news,  :news) 

cmf$  layout  up (: news,  :news) ,  v(:news,  :news),  vp(:news,  :news) 

cmf$  layout  ab(:news,  .-news),  ub(:news,  .-news),  vb(:news,  :news) 
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cmf$  layout  sv(:news),  a_original(  mews ,  :nevs) 
interface 

subroutine  one2two(a,ap ,b ,  m,  n) 
integer  m,  n 

real  a(m,n) ,  ap(m,n),  b(2*m,n) 

cmf$  layout  a(:news,  :news) ,  ap(:news,  :news) ,  b(:news,  :news) 
end  interface 

interface 

subroutine  svdcore  (a,  ap,  u,  up,  m,  n,  irank,  isweep,  deltas,  eps) 
integer  irank,  isweep,  m,  n 

real  a(m,n),  u(m,n) ,ap(m,n) ,  up(m,n),  eps, deltas 
cmf$  layout  a( :news, :news) ,  u( :news , :news) 
cmf$  layout  ap( :news , :news) ,  up(:news,  mews) 
end  interface 

interface 

subroutine  two2one(a,ap ,b ,m,n) 
integer  m,  n 

real  a(m,n),  ap(m,n) ,  b(2*m,n) 

cmf$  layout  a(:news,  mews),  ap(:news,  mews),  b(:news,  mews) 
end  interface 

interface 

subroutine  evaluatel  (a, u.v.sv.irank, deltas, m,n) 
integer  m,  n,  irank 

real  a(m,n),  v(m,n),  u(m,n) ,  sv(m),  deltas 
cmf$  layout  a(  mews ,  mews) ,  u(  mews ,  mews)  ,  v(  mews,  mews) 
cmf$  layout  sv(mews) 

end  interface 

interface 

subroutine  evaluate2  (a,u,a_original,m,n) 
integer  m,  n 

real  a(m,n),  u(m,n) ,  a_original(m,n) 
cmf$  layout  a(  mews ,  mews) ,  u(  mews ,  mews) 
cmf$  layout  a_original(  mews ,  mews) 

end  interface 

C - 

a_ original  =  ab 
print*,’ call  one2two’ 
call  CM_timer_clear(l) 
call  CM.timer. start  (1) 
call  one2two(a,  ap,  ab,  m,  n) 
call  one2two(u,  up,  ub,  m,  n) 
call  CM_timer_stop(l) 
call  CM_timer_print(l) 
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print*, ’call  svdcore’ 
call  CM_timer_start(l) 

call  svdcore  (a,  ap,  u,  up,  m,  n,  irank,  isweep,  deltas,  eps) 

call  CM_timer_stop(l) 

call  CM_timer_print(l) 

call  CM_timer_start(l) 

print*, ’call  two2one’ 

call  two2one(a,  ap,  ab,  m,  n) 

call  two2one(u,  up,  ub,  m,  n) 

call  CM_timer_stop(l) 

call  CM_timer_print(l) 

print*, ’call  evaluate’ 

call  CM_timer_start(l) 

call  evaluatel  (ab,ub,vb,sv, irank, deltas, 2*m,n) 

call  evaluate2  (ab,ub,a_original,2*m,n) 

call  CM_timer_print(l) 

print* .done  svd’ 

call  CM_timer_stop(l) 

return 

end  subroutine  svd 

subroutine  svdcore  (a,  ap,  u,  up,  m,  n, irank, isweep, deltas, eps) 
integer  m,  n,  irank,  isweep 

real  a(m,n) ,  u(m,n),  ap(m,n) ,  up(m,n),  eps,  deltas 
C  Main  locals 

real  alpha  (m,n),  beta(m,n),  gamma(o,n),  c(m,  n),  s(m,  a) 
real  norms(m.n),  normsp(m.n) 

C  scalars  to  compute  convergence  criterion 
real  epss ,  Fnorms 
C  temporaries 

real  atemp(m,n),  utemp(m,n),  ntemp(m,n),  ortho(m.n) 
integer  row(m,n) ,  col(m,n),  irotl(m.n),  irot2(m,n) 
logical  rotate(m.n) 

C  loop  control  variables 

integer  m2,  index,  i,  j,  numsweep,  numrotate 
C  constant 

integer  sup ,  sdown 
C  layout  on  the  connection  machine 
cmf$  layout  a( : news ,: news) ,  u( :news , :news) 
cmf$  layout  ap(  mews,  mews) ,  up(:news,  mews) 
cmf$  layout  norms ( mews ,  mews) ,  normspC  mews ,  mews) 

cmf$  layout  alpha(mews,  mews),  beta(  mews ,  mews) ,  gamma  ( mews,  mews) 

cmf$  layout  utemp( mews,  mews) ,  atemp(  mews ,  mews) ,  ntemp(  mews,  mews) 

cmf$  layout  c( mews ,  mews) ,  s( mews,  mews) 

cmf$  layout  row( mews,  mews) ,  col( mews,  mews) 

cmf$  layout  irotl(:news,  mews),  irot2(:news,  mews) 

cmf$  layout  ortho( mews, mews) ,  rotate(:news,  mews) 

C . . . - . 
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C  initialize 

m2  =  2*m 

numsweep  =  isueep 
epss  =  eps*eps 

sdown*  +1  !a(k)< - a(k+l) 

sup  *  -1  !a(k+l)<-a(k) 

norms  =  spread  (sum(a*a,2) ,2,n) 

normsp  *  spread  (sum(ap*ap,2) ,2,n) 

Fnorms  =  sum(norms( : , 1) , 1)  +  sum(normsp( : , 1) , 1) 
deltas  *  epss*Fnorms 

print* , ’Frobenius  norm  squared  *  Fnorms 
print*, 'Square  of  machine  precision  *  Fnorm  *  deltas 
forall  (i=l:m,  j=l:n)  col(i,j)  *  j 
u  =0.0 
up  =  0.0 

forall  (i=l:m,  j=l:n)  row(i,j)  =  2*i-i 

where  (row. eq. col)  u=  1.0 

forall  (i=l:m,  j=l:n)  row(i,j)  =  2*i 

where  (row. eq. col)  up  *1.0 

forall  (i=l:m,  j=l:n)  row(i,j)  =  i 

C - - - . . 

isweep  *  0 

100  continue 

C  start  odd  sweep 

isweep  *  isweep  *1 

norms  *  spread(sum(a*a,2) ,2,n) 

normsp  *  spread(sum(ap*ap ,2) ,2,n) 

C  unroll  loop  by  2 

do  index  *  1 ,  m2 ,  2 

C  start  odd  index 

alpha  *  2*spread(sum(a*ap,2) ,2,n) 
beta  =  norms  -normsp 

gamma  =  sqrt((alpha*alpha)+(beta*beta)) 

ortho  =  0.25*alpha* alpha  -  deltas*min (norms,  normsp) 

rotate  =  (norms .ge .deltas) . and. (normsp .ge .deltas) . and. (ortho .ge .0) 

where  ((beta. ge.0) .and. rotate) 

c  =  3qrt((gamma+beta)/(2.0*gamma)) 
s  =  alpha  /  (2*gamma*c) 
atemp  =  -s*a  +  c*ap 
utemp  *  -s*u  +  c*up 

ntemp  =  s*s*norms  +  c*c*normsp  -  alpha*c*s 
ap  =  c*a  +  s*ap 

up  =  c*u  ♦  s*up 

normsp*  c*c*norms  +  s*s*normsp  ♦  alpha*c*s 
a  *  atemp 

u  =  utemp 

norms  =  ntemp 

endwhere 
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where  ( (beta. It. 0) .and. rotate) 

8  =  sign (sqrt ( (gamma-beta) /  (2 . 0*gamma) )  ,  alpha) 
c  =  alpha  /  (2.0*gamma*s) 
atemp  =  -s*a  ♦  c*ap 
utemp  *  -s*u  ♦  c*up 

ntemp  =  s*s*norms  +  c*c*normsp  -  alpha*c*s 
ap  =  c*a  +  s*ap 
up  =  c*u  +  s*up 

normsp=  c*c*norms  +  s*s*normsp  +  alpha*c*s 
a  =  atemp 
u  =  utemp 
norms  =  ntemp 
endwhere 

where  ((beta.gt .0) .and. ( .not .rotate) ) 
atemp  =  ap 
utemp  =  up 
ntemp  =  normsp 
ap  =  a 
up  =  u 
normsp  =  norms 
a  =  atemp 
u  =  utemp 
norms  =  ntemp 
endwhere 

C  communicate  (a,  ap)  to/from  processors  aligned  with  odd  rows 

atemp  ■  ap 
utemp  =  up 
ntemp  =  normsp 
ap  =cshift(a,  1,  sdown) 
up  =cshift(u,  1,  sdown) 
normsp  =cshift (norms ,  1,  sdown) 
a  =  atemp 
u  =  utemp 
norms  *  ntemp 

C  start  even  index 

alpha  *  2*spread(sum(a*ap,2) ,2,n) 

beta  =  norms  -normsp 

gamma  *  sqrt((alpha*alpha)+(beta*beta)) 

ortho  *  0.25* alpha* alpha  -  deltas*min (norms,  normsp) 

rotate  «  (norms .ge. deltas) .and. (normsp. ge. deltas) .and. (ortho. ge.  0) 

.and. (row.ne.m) 

where  ((beta. ge.O) .and. rotate) 

c  =  sqrt((gamma+beta)/(2.0*gamma))  Icosine  term 

8  *  alpha  /  (2*gamma*c)  !sine  term 

atemp  =  -s*a  +  c*ap 
utemp  =  -8*u  +  c*up 

ntemp  =  s*s*norms  +  c*c*normsp  -  alpha*c*s 
ap  *  c*a  ♦  s*ap 
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up  =  C*U  +  s*up 

normsp=  c*c*norms  +  s*s*normsp  +  alpha*c*s 
a  =  atemp 

u  =  utemp 

norms  =  ntemp 

endwhere 

where  ( (beta. It. 0) .and. rotate) 

s  =  sign(sqrt( (gamma-beta) / (2. 0*gamma))  ,  alpha) 
c  *  alpha  /  (2.0*gamma*s) 
atemp  -  -s*a  +  c*ap 
utemp  =  -s*u  +  c*up 

ntemp  =  s*s*norms  +  c*c*normsp  -  alpha*c*s 
ap  =  c*a  +  s*ap 
up  =  c*u  +  s*up 

normsp=  c*c*norms  +  s*s*normsp  +  alpha*c*s 
a  =  atemp 
u  =  utemp 
norms  =  ntemp 
endwhere 

where  ((beta.gt.O) .and. ( .not .rotate) .and. (row.ne.m)) 
atemp  =  ap 
utemp  =  up 
ntemp  =  normsp 
ap  =  a 
up  =  u 
normsp  *  norms 
a  *  atemp 
u  *  utemp 
norms  =  ntemp 
endwhere 

C  communicate  (a,  ap)  to/from  processors  aligned  with  odd  rows 

atemp  =  a 
utemp  =  u 
ntemp  =  norms 
a  =cshift(ap,  1,  sup) 
u  =cshift(up,  1,  sup) 
norms  =cshift(normsp,  1,  sup) 
ap  *  atemp 
up  =  utemp 
normsp  =  ntemp 
enddo  !  end  odd  sweep 
C  start  even  sweep 

C  number  of  rotations  kept  in  irotl  and  irot2 

isweep  =  isweep  +1 
irotl  =0 
irot2  =0 

do  index  =  1 ,  m2 ,  2 
C  start  odd  index 
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alpha  =  2*spread(sum(a*ap,2) ,2,n) 

beta  =  normsp  -norms 

gamma  =  sqrt((alpha*alpha)+(beta*beta)) 

ortho  =  0.25*alpha*alpha  -  deltasbmin(norms ,  normsp) 

rotate  ■  (norms. ge. deltas) .and. (normsp. ge. deltas) .and. (ortho. ge.O) 

where  ((beta.ge .0) .and. rotate) 

c  =  sqrt((gamma+beta)/(2.0*gamma))  icosine  term 

s  =  alpha  /  (2*gamma*c)  Isine  term 

atemp  =  s*a  +  c*ap 
utemp  =  s*u  +  c*up 

ntemp  =  s*s*norms  +  c*c*normsp  +  alphabet's 
ap  =  c*a  -  s*ap 
up  =  c+u  -  s*up 

normsp=  c*c*norms  +  s*s*normsp  -  alpha*c*s 
a  =  atemp 
u  =  utemp 
norms  =  ntemp 
irotl  =  irotl  +1 
endwhere 

where  ( (beta . It . 0) . and . rotate) 

s  *  sign(sqrt((gamma-bets.)/(2.0*gamma))  ,  alpha) 
c  =  alpha  /  (2.0*gamma*s) 
atemp  *  s*a  +  c*ap 
utemp  =  8*u  +  c*up 

ntemp  =  s*s*norms  +  c*c*normsp  +  alpha*c*s 
ap  *  c*a  -  s*ap 
up  =  c*u  -  s*up 

normsp=  c*c*norms  +  s*sbnormsp  -  alpha*c*s 
a  =  atemp 
u  =  utemp 
norms  =  ntemp 
irotl  =  irotl  +1 
endwhere 

where  ((beta.gt.O) .and. ( .not. rotate)) 
atemp  =  ap 
utemp  =  up 
ntemp  =  normsp 
ap  =  a 
up  =  u 
normsp=  norms 
a  =  atemp 
u  =  utemp 
norms  =  ntemp 
endwhere 

C  communicate  (a,  ap)  to/from  processors  aligned  with  odd  rows 

atemp  =  ap 
utemp  -  up 
ntemp  =  normsp 
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ap  =  cshift(a,  1,  sdown) 

up  =  cshift(u,  1,  sdown) 

normsp®  cshift (norms ,  1,  sdown) 
a  =  atemp 

u  =  utemp 

norms  *  ntemp 

C  end  odd  index  of  even  sweep 

C  start  even  index 

alpha  =  2*spread(sum(a*ap,2) ,2,n) 

beta  =  normsp  -norms 

gamma  =  sqrt((alpha*alpha)+(beta*beta)) 

ortho  =  0.25*alpha*alpha  -  deltas*min(norms ,  normsp) 

rotate  =  (norms .ge .deltas) . and. (normsp .ge .deltas) . and. (ortho .ge .0) 

. and. (row .ne .m) 

where  ((beta. ge.O) .and. rotate) 

c  *  sqrt((gamma+beta)/(2.0*gamma))  icosine  term 

8  =  alpha  /  (2*gamma*c)  !sine  term 

atemp  *  s*a  +  c*ap 
utemp  =  s*u  +  c*up 

ntemp  =  s*s*norms  +  c*c*normsp  +  alpha*c*s 
ap  =  c*a  -  s*ap 
up  *  c*u  -  s*up 

normsp=  c*c*norms  +  s*s*normsp  -  alpha*c*s 
a  =  atemp 
u  *  utemp 
norms  *  ntemp 
irot2  =  irot2  +1 
endwhere 

where  ( (beta. It. 0) .and. rotate) 

8  *  sign(sqrt( (gamma-beta)/ (2. 0*gamma))  ,  alpha) 
c  =  alpha  /  (2.0*gamma*s) 
atemp  =  s*a  +  c*ap 
utemp  =  s*u  +  c*up 

ntemp  =  s*s*norms  +  c*c*normsp  +  alpha*c*s 
ap  =  c*a  -  s*ap 
up  =  c*u  -  s*up 

normsp®  c*c*norms  +  s*s*normsp  -  alpha*c*s 
a  »  atemp 
u  =  utemp 
norms  =  ntemp 
irot2  =  irot2  +1 
endwhere 

where  ((beta.gt .0) .and. ( .not .rotate) . and. (row.ne.m) ) 
atemp  =  ap 
utemp  =  up 
ntemp  =  normsp 
ap  =  a 
up  =  u 
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normsp=  norms 
a  =  atemp 
u  =  utemp 
norms  =  ntemp 
endwhere 

C  communicate  (a,  ap)  to/from  processors  aligned  with  odd  rows 

atemp  =  a 
utemp  =  u 
ntemp  =  norms 
a  =  cshift(ap,  1,  sup) 
u  =  cshift(up,  1,  sup) 
norms  =  cshift(normsp ,  1,  sup) 
ap  =  atemp 
up  =  utemp 
normsp=  ntemp 
enddo  !  end  even  sweep 
irotl  =  irotl  +  irot2 
numrotate  =  sum(irotl(l :m, 1) , 1) 

print*,’  sweep  ’,  isweep,  *  ’.numrotate,’  rotations’ 
if  (numrotate. eq.O)  goto  300 
if  ( isweep. eq.numsweep)  goto  300 
goto  100 
300  continue 

print*, ’done  rotation. .. calculating  singular  values...’ 
return 

end  subroutine  svdcore 


subroutine  evaluatel  (a, u,v,sv,irank, deltas, m,n) 
integer  m,  n,  irank 

real  a(m,n),  v(m,n) ,  u(m,n),  sv(m) ,  deltas 
integer  row(m,n),  col(m,n),  one(m),  ier,  i,  j 
real  oned(m) ,  tl(m,m),  t2(m,m),  t3(m,m) 
cmf$  layout  a( :news , :news) ,  u( :news , inews) ,  v( mews, mews) 
cmf$  layout  sv(mews) 

cmf$  layout  row  (mews,  mews) ,  col(  mews,  mews) 

cmf$  layout  one(mews),  oned(mews) 

cmf$  layout  tl(  mews ,  mews) ,  t2(  mews,  mews) ,  t3(  mews ,  mews) 
C  calculate  singular  values  and  rank 

oned  =  sum(a*a,2) 
sv  =  sqrt(oned) 
where  (oned. gt. deltas) 
one  *1 
elsewhere 
one  =0 
sv  =  0.0 
endwhere 

irank  =  sum  (one(l:m)) 
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C  calculate  v 

t3  *  spread  (sv,2,m) 
v  *  0.0 

where  (t3(l :m, 1 :n) .gt .0)  v(l:m,  lm)  =  a(l:m,  l:n)/t3(l:m,  l:n) 

C  debug  codes  beyond  the  next  statement 

return 

C  detailed  check 

print*,’max  min  of  v’ 

print  *,  maxval(v(l :m,  l:n)),  minval(v(l  :m,  l:n)) 

C  evaluate  VtV 

tl  =  0.0 
t2  =  0.0 

tl(l:m,  l:n)  =  v(l:m,  l:n) 
t2  *  transpose(tl) 
t3  =  0.0 

t3  *  matmul  (tl,  t2) 
forall  (i=l:m,  j=l:n)  row(i,j)=  i 
forall  (i=l:m,  j=l:n)  col(i,j)=  j 
print* , ’maxval  VVt  off  diagonal 

.  maxval(abs(t3(l:n,l:n)) ,  mask=(row.ne.col)) 

C  evaluate  UUt 

tl  *  0.0 
t2  *  0.0 

tl(l:m,  l:n)  *  u(l:m,  l:n) 
t2  =  transpose(tl) 
t3  =  0.0 

t3  =  matmul  (tl,  t2) 

print* , ’maxval  UUt  off  diagonal  ’, 

.  maxval(abs(t3(l:n,l:n)) ,  mask=(row.ne.col)) 

100  return 

end  subroutine  evaluate 1 

subroutine  one2two(a,ap,b,m,n) 
integer  m,  n 

real  a(m,n),  ap(m,n),  b(2*m,n) 

cmf$  layout  a(:nevs,  inews),  ap(:news,  mews),  b(:news,  mews) 

forall  (i=l:m,  j=l:n)  a(i,j)=  b(2*(i-l)  +1,  j) 

forall  (i=l:m,  j*l:n)  ap(i,j)«  b(2*i,  j) 
return 
end 

subroutine  two2one(a,ap,b,m,n) 
integer  m,  n 

real  a(m,n) ,  ap(m,n),  b(2*m,n) 

cmf$  layout  a(:news,  mews),  ap(mews,  mews),  b(:news,  mews) 

forall  (i=l:2*m-l:2,  j=l:n)  b(i,j)=  a(l+((i-l)/2) ,  j) 
forall  (i=2:2*m:2,  j=l:n)  b(i,j)=  ap(l+((i-l)/2) , j) 
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return 

end 

subroutine  evaluate2  (a,u,a_original,m,n) 
integer  m,  n,  irank 
real  a(m,n) ,  u(m,n),  a_original(m,n) 
real  tl(m,m),  t2(m,m) ,  t3(m,m) 
cmf$  layout  a( :news , :nevs) ,  u( :news , :news) 

cmf$  layout  a_original( :news ,  :news) 

cmf $  layout  t 1 ( mews, :news) ,  t2( :news , :news) ,  t3( mews , :news) 

C 

C  evaluate  USVt 

tl  =  0.0 
t2  =  0.0 
t2  (1 :m, 1 :n)  =  a 
tl(l:m,  l:n)  =  u(l:m,  l:n) 
tl  =  transposeCtll 
t3  =  0.0 

t3  =  matmul  (tl,  t2) 

t3(l:m,l:n)  =  t3(l:m,  l:n)  -  a_original 

print*, ’error  =  max(abs(  U  S  Vt  -  A  ))  is  * ,  maxval(abs(t3) ) 
return 

end  subroutine  evaluate2 
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CM  FORTRAN  CODE  FOR  SUBROUTINE  SVD  OPTIMIZED  FOR  THE  CASE 

m  >  n 


A'. R-The  main  subroutine  svd  contains  7  units:  svdcore  contains  the  main  Jacobi  rotation  codes; 
two2one  allocates  arrays  on  the  CM  two  row-elements  per  processor  while  one2two  performs 
the  reverse  process;  evaluatel  calculates  the  S  and  V  matrices;  evaluatc2  calculates  the  residual 
error;  p2one  allocates  an  rn  x  m  array  into  ^  arrays,  each  ~  x  n,  while  one2p  performs  the 
reverse.  In  this  unrolled  version  of  svdcore  similar  chunks  of  codes  are  lepeated  4  times,  each 
slightly  different  from  tne  other.  This  is  to  avoid  invoking  if-then  clauses  that  would  otherwise 
fragment  the  resulting  codes.  Further,  calculations  involving  the  matrix  U  is  carried  out  in  a 
^-times  do  loop. 


subroutine  svd  (ab,ub,vb,sv,p,m,n,irank,isweep,epe) 

C  Author:  Nhi-Anh  Chu 

C  Connection  Machine  Facility 

C  Code  5153,  Naval  Research  Lab 

C  Nov  9  1990 

C  Revised  Jan  3  1991 

C  p  =  2*int(m/n) 

C  ab  —  2m  x  n  matrix  A,  to  be  decomposed  into  singular  values 

C  sv  =  diag  (S)  such  that  A  =  (U  S  Vt) 

C  ab  is  returned  as  (Ut  A)  where  Ut  is  prod”Ct  of  Jacobi  rotation 

C  matrices  on  (At  A) 

C  ub  —  2m  x  2m  matrix  returned  with  Ut 

C  vb  —  2m  x  n  matrix  returned  with  Vt 

C  sv  —  2m-vector  returned  with  diag(S) ,  the  singular  values  of  A 

C  irank  —  integer  returned  with  the  rank  of  A 

C  iswe<jp  —  integer  returned  with  number  of  sweeps  of  the  rotations; 

C  each  sweep  orthogonalize  every  row-pair  combinations  of  A 

C  eps  --  real  number  specifying  the  machine  precision,  used  to  determine 

C  a  "zero" 

integer  p,  m,  n,  irank,  isweep 

real  ab(2*m,  n) ,  ub(2*m,  p*n) ,  vb(2*m,  n) ,  sv(2*m) 
real  eps,  deltas 

real  a(m,  n) ,  ap(m,n),  u(p,m,n),  up(p,m,n),  v(m,n),  vp(m,n) 
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real  a_original(2*m,n) 

cmf$  layout  a(:nevs,  mews),  ap(:news,  :news) ,  u(: serial,  : news ,  mews) 
cmf$  layout  up(:serial,  mews,  :neus),  v(:news,  :news) ,  vp(:nevs,  mews) 

cmf$  layout  ab(:news,  mews),  ub(:news,  mews),  vb(mews,  mews) 

cmf$  layout  sv(:news),  a_original( mews ,  mews) 

interface 

subroutine  one2two(a,ap,b,  m,  n) 
integer  m,  n 

read  a(m,n),  apCm.n),  b(2*m,n) 

cmf$  layout  a(:news,  mews),  ap(mews,  mews),  b(:news,  mews) 
end  interface 

interface 

subroutine  svdcore  (a,  ap,  u,  up,  p,  m,  n,  irank,  isweep,  deltas,  eps) 
integer  irauik,  isweep,  p,  m,  n 

real  a(m,n),  u(p,m,n) ,ap(m,n) ,  up(p,m,n),  eps, deltas 
cmf$  layout  a( mews ,  mews)  ,  u(:serial,  mews,  mews) 
cmf$  layout  ap( mews,  mews) ,  up(:serial,  mews,  mews) 
end  interface 

interface 

subroutine  two2one(a,ap,b,m,n) 
integer  m,  n 

real  a(m,n),  ap(m,n),  b(2*m,n) 

cmf$  layout  a(:news,  mews),  ap(:news,  mews),  b(:news,  mews) 
end  interface 

interface 

subroutine  p2one(u,up,ub,p,m,n) 
integer  p,  m,  n 

read  u(p,m,n),  up(p,m,n),  ub(2*m,p*n) 
cmf$  layout  u(: serial,  mews,  mews),  up(: serial,  mews,  mews) 

cmf$  layout  ub(:news,  mews) 

end  interface 

interface 

subroutine  one2p(u,up,ub,p,m,n) 
integer  p,  m,  n,  k 

real  u(p,m,n),  up(p,n,n),  ub(2*o,p*n) 
cmf$  layout  u(:serial,  mews,  mews),  up(:serial,  mews,  mews) 

cmf$  layout  ub(mews,  mews) 

end  interface 

interface 

subroutine  evaluatel  (a, u,v,sv,irank, deltas, m,n) 
integer  m,  n,  irank 

real  a(m,n),  v(m,n),  u(m,m) ,  sv(n) ,  deltas 
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cmf$  layout  a( :news , :news) ,  u( mews, :news) ,  v( mews,  mews) 
cmf$  layout  sv(:news) 

end  interface 

interface 

subroutine  evaluate2  (a,u, a.original ,m,n) 
integer  m,  n 

real  a(m,n),  u(m,n) ,  a_original(m,n) 
cmf$  layout  a(  rnews ,  rnews)  ,  u(  .-news, : news) 
cmf$  layout  a_original(  mews ,  mews) 

end  interface 

C - - - - - - - - 

C  make  sure  that  p  is  2*m/n 

if  (p.ne. (2*m/n))  stop  ’p  must  be  equal  to  m/n* 

aboriginal  =  ab 

print* , *  call  one2two ’ 

call  CM_timer_clear(l) 

call  CM_timer_start  (1) 

call  one2two(a,  ap,  ab,  m,  n) 

call  one2p(u,  up,  ub,  p,  m,  n) 

call  CM_timer_8top(l) 

call  CM_timer_print(l) 

print* , ’ call  svdcore ’ 

call  CM_timer_8tart(l) 

call  svdcore  (a,  ap,  u,  up,  p,  m,  n,  irank,  isweep,  deltas,  epB) 

call  CH_timer_stop(l) 

call  CM_timer_print(l) 

call  CM_timer_start(l) 

print* , *  call  two2one ’ 

call  two2one(a,  ap,  ab,  m,  n) 

call  p2one(u,  up,  ub,  p,  m,  n) 

call  CM_timer_stop(l) 

call  CM_timer_print(l) 

print* , ’ call  evaluate ’ 

call  CM_timer_8tart(l) 

call  eva)uatol  (ab,ub,vb,sv, irank, deltas, 2*m,n) 

call  evaluate2  (ab,ub,a_original,2*m,n) 

call  CM_timer_print(l) 

print* , ’ . . . done  svd ' 

call  CM_timer_stop(l) 

return 

end  subroutine  svd 

subroutine  svdcore  (a,  ap,  u,  up,  p,  m,  n, irank, isweep, deltas, eps) 
integer  p,  m,  n,  irank,  isweep 

real  a(m,n),  u(p,m,n),  ap(m,n),  up(p,m,n),  eps,  deltas 
C  Main  locals 

real  alpha  (m,n),  beta(m,n),  gamma(m,n),  c Cm,  n) ,  s(m,  n) 
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real  norms(m,n),  normsp(m,n) 

C  scalars  to  compute  convergence  criterion 
real  epss,  Fnorms 
C  temporaries 

real  atemp(m,n),  utempCp,  m,n) ,  ntemp(m,n),  ortho(m.n) 
integer  rov(m,n),  col(m,n),  irotl(m,n),  irot2(m,n) 
logical  rotate(m.n) 

C  loop  control  variables 

integer  m2,  index,  k,  i,  j,  numsweep,  numrotate 
C  constant 

integer  sup,  sdown 
C  layout  on  the  connection  machine 

cmf$  layout  a(  mews,  mews) ,  u(: serial,  mews,  mews) 
cmf$  layout  ap( mews ,  mews) ,  up(: serial,  mews,  mews) 
cmf$  layout  norms ( mews,  mews)  ,  normspC mews,  mews) 

cmf$  layout  alpha(mews,  mews),  beta(  mews,  mews) ,  gaanma(  mews,  mews) 

cmf$  layout  utemp(: serial,  mews, mews) 

cmf $  layout  atemp(  mews ,  mews)  ,  ntemp(  mews ,  mews) 

cmf$  layout  c( mews,  mews) ,  s( mews,  mews) 

cmf$  layout  row( mews news) ,  col( mews,  mews) 

cmf$  layout  irotl(:news,  mews),  irot2(:news,  mews) 

cmf$  layout  ortho(  mews,  mews) ,  rotateC mews ,  mews) 

C - 

C  initialize 

m2  *  2*m 

numsweep  *  isweep 
if  (m2.1t.n)  then 

print* , ’There  must  be  no  more  columns  than  rows.  ’ 
print* , ’Transpose  the  matrix’ 
irank  =  0 
stop 
end  if 

epss  =  eps*ep8 

sdown=  +1  !a(k)< - a(k+l) 

sup  =  -1  !a(k+l)<-a(k) 

norms  *  spread  (sum(a*a,2) ,2,n) 

normsp  *  spread  (sum(ap*ap,2) ,2,n) 

Fnorms  ■  sum(norms( : , 1) , 1)  ♦  sum(normsp( : , 1) , i) 
deltas  *  epss*Fnorms 

print*, ’Frobenius  norm  squared  =  ’,  Fnorms 

print*, ’Square  of  machine  precision  *  Fnorm  =  ’,  deltas 

u  *0.0 

up  *  0.0 

do  k  *1,  p 

forall  (i=l:m,  j=l:n)  col(i,j)  =  j+  (k-l)*n 
forall  (i=l:m,  j=l m)  row(i,j)  *  2*i-l 
where  (row. eq. col)  u(k,:,:)*  1.0 
forall  (i=l:m,  j=l:n)  row(i,j)  =  2*i 
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where  (row. eq. col)  up(k,:,:)  *1.0 
enddo 

forall  (i*l:m,  j*l:n)  row(i,j)  *  i 


isweep  =  0 

100  continue 

C  start  odd  sweep 

isweep  =  isweep  +1 

norms  =  spread(sum(a*a,2) ,2,n) 

normsp  =  spread(sum(ap*ap,2) ,2,n) 

C  unroll  loop  by  2 

do  index  =  1,  m2,  2 

C  start  odd  index 

alpha  =  2*spread(sum(a*ap,2) ,2,n) 
beta  =  norms  -normsp 

gamma  *  sqrt((alpha*alpha)+(beta*beta)) 

ortho  *  0.25*alpha*alpha  -  deltas*min(norms ,  normsp) 

rotate  =  (norms. ge. deltas) .and. (normsp. ge. deltas) .and. (ortho. ge.O) 

where  ((beta. ge.O) .and. rotate) 

c  =  sqrt((gamma+beta)/(2.0*gamma)) 
s  =  alpha  /  (2*gamma*c) 
atemp  *  -s*a  +  c*ap 

ntemp  *  s*s*norms  ♦  c*c*normsp  -  alpha*c*s 
ap  =  c*a  +  s*ap 

normsp*  c*c*norms  +  s*8*normsp  +  alpha*c*s 
a  =  atemp 
norms  =  ntemp 
endwhere 

where  ( (beta.lt .0) .and. rotate) 

s  =  sign(sqrt((gamma-beta)/(2.0*gamma))  ,  alpha) 
c  *  alpha  /  (2.0*gamma*s) 
atemp  =  -s*a  +  c*ap 

nt-emp  =  s*s*norms  +  c*c*normsp  -  alpha*c*s 
ap  =  c*a  +  s*ap 

normsp*  c*c*norms  +  s*s*normsp  +  alpha*c*s 
a  *  atemp 
norms  *  ntemp 
endwhere 
do  k=l,p 
where  (rotate) 

utemp(k,:,:)  *  -s*u(k,:,:)  +  c*up(k,:,:) 
up(k, : , : )  *  c*u(k, : , : )  +  s*up(k,:,:) 

u(k, : , :)  *  utemp(k, : , :) 

endwhere 
enddo 

where  ((beta.gt.O) .and. (.not. rotate)) 
atemp  *  ap 
ntemp  *  normsp 
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ap  *  a 

normsp  *  norms 
a  =  atemp 

norms  =  ntemp 
endvhere 
do  k  =1 ,p 

where(.(beta.gt.O)  .and.  (.not. rotate)) 
utemp(k, : , : )  =  up(k,:,:) 
up(k, : , :)  =  u(k, : , :) 

u(k,:,:)  =  utemp(k, : , :) 

endwhere 
enddo 

C  communicate  (a,  ap)  to/from  processors  aligned  with  odd  rows 

atemp  =  ap 
ntemp  =  normsp 
ap  =cshift(a,  1,  sdown) 
normsp  =cshift (norms ,  1,  sdown) 
a  =  atemp 
norms  *  ntemp 
do  k=l  ,p 

utemp(k,:,:)  ■  up(k,:,:) 
up(k,:,:)  »cshift(u(k, : , :) ,  1,  sdown) 
u(k,:,:)  *  utemp(k, : , :) 
enddo 

C  start  even  index 

alpha  *  2*spread(sum(a*ap,2) ,2,n) 

beta  =  norms  -normsp 

gamma  *  sqrt((alpha*alpha)+(beta*beta)) 

ortho  =  0.25*alpha*alpha  -  deltas*min (norms,  normsp) 

rotate  *  (norms. ge. deltas) .and. (normsp. ge. deltas) .and. (ortho. ge.O) 

. and. (row.ne.m) 

where  ((beta. ge.O) .and. rotate) 

c  *  sqrt((gamma+beta)/(2.0*gamma))  ! cosine  term 

s  *  alpha  /  (2*gamma*c)  !sine  term 

atemp  =  -s*a  +  c*ap 

ntemp  =  s*s*norms  +  c*c*normsp  -  alpha*c*s 
ap  =  c*a  +  s*ap 

normsp=  c*c*norms  +  s*s*normsp  +  alpha*c*s 
a  *  atemp 

norms  *  ntemp 

endwhere 

where  ( (beta. It. 0) .and. rotate) 

s  *  sign(sqrt((gamma-beta)/(2.0*gamma))  ,  alpha) 
c  »  alpha  /  (2.0*gamma*s) 
atemp  *  -s*a  +  c*ap 

ntemp  =  s*s*norms  +  c*c*normsp  -  alpha*c*s 
ap  =  c*a  +  s*ap 

normsp=  c*c*norms  +  8*s*normsp  +  alpha*c*s 
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a  =  atemp 
norms  =  ntemp 
endwhere 
do  k=l  ,p 

where  (rotate) 

utemp(k,:,:)  =  -s*u(k,:,:)  +  c*up(k,:,:) 
up(k,:,:)  =  c*u(k,:,:)  +  s*up(k,:,:) 

u(k,;,:)  =  utempCk,:,:) 

endwhere 
enddo 

where  ((beta.gt .0) .and. ( .not .rotate) .and. (row.ne.m)) 
atemp  =  ap 
ntemp  =  normsp 
ap  ■  a 

normsp  =  norms 
a  -  atemp 

norms  =  ntemp 
endwhere 
do  k=l,p 

where  ((beta.gt.O) .and. (.not. rotate) .and. (row.ne.m)) 
utemp(k,:,:)  =  up(k,:,:) 
up(k, : , :)  =  u(k, : , :) 

u(k,:,:)  =  utemp(k, : , :) 

endwhere 
enddo 

C  communicate  (a,  ap)  to/from  processors  aligned  with  odd  rows 

atemp  =  a 
ntemp  *  norms 
a  =cshift(ap,  1,  sup) 
norms  =cshift (normsp ,  1,  sup) 
ap  =  atemp 
normsp  =  ntemp 
do  k=l ,p 

utemp(k,:,:)  =  u(k,:,:) 
u(k,:,:)  =cshift(up(k, : , :) ,  1,  sup) 
up(k,:,:)  =  utemp(k,:,:) 
enddo 

enddo  !  end  odd  sweep 
C  start  even  sweep 

C  number  of  rotations  kept  in  irotl  and  irot2 

isweep  =  isweep  +1 
irotl  =0 
irot2  =0 

do  index  *  1,  m2,  2 
C  start  odd  index 

alpha  =  2*spread(sum(a*ap,2) ,2,n) 
beta  *  normsp  -norms 

gamma  =  sqrt((alpha*alpha)+(beta*beta)) 
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ortho  *  0.25*alpha*alpha  -  deltas*min(norms ,  normsp) 

rotate  =  (norms .ge. deltas) .and. (normsp. ge. deltas) .and. (ortho. ge.O) 

where  ((beta. ge.O) .and. rotate) 

c  =  sqrt((gamma+beta)/(2.0*gamma))  ! cosine  term 

s  =  alpha  /  (2*gamma*c)  !sine  term 

atemp  *  s*a  +  c*ap 

ntemp  =  s*s*norms  +  c*c*normsp  +  alpha*c*s 
ap  =  c*a  -  s*ap 

normsp=  c*c*norms  +  s*s*normsp  -  alpha*c*s 
a  =  atemp 
norms  =  ntemp 
irotl  =  irotl  +1 
endwhere 

where  ((beta. It. 0) .and. rotate) 

s  *  sign(sqrt( (gamma-beta)/ (2. 0*gamma))  ,  alpha) 
c  *  alpha  /  (2.0*gamma*s) 
atemp  *  s*a  +  c*ap 

ntemp  =  s*s*norms  +  c*c*normsp  +  alpha*c*s 
ap  =  c*a  -  s*ap 

normsp*  c*c*norms  +  s*s*normsp  -  alpha*c*s 
a  =  atemp 
norms  *  ntemp 
irotl  *  irotl  +1 
endwhere 
do  k=l,p 

where  (rotate) 

utemp(k, : , :)  =  s*u(kf:,:)  +  c*up(k,:,:) 

up(k,:,:)  *  c*u(k,:,:)  -  B*up(k,:,:) 

u(k,:,:)  *  utemp(k, : , :) 

endwhere 
enddo 

where  ((beta.gt.O) .and. ( .not. rotate) ) 
atemp  *  ap 
ntemp  =  normsp 
ap  *  a 
normsp*  norms 
a  *  atemp 
norms  *  ntemp 
endwhere 
do  k  *l,p 

where  ((beta.gt.O) .and. ( .not .rotate)) 
utemp(k,:,:)  =  up(k,:,:) 

up(k, : , :)  =  u(k, : , :) 

u(k, : , :)  *  utemp(k, : , :) 

endwhere 
enddo 

communicate  (a,  ap)  to/from  processors  aligned  with  odd  rows 
atemp  *  ap 
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ntemp  =  normsp 

ap  =  cshift(a,  1,  sdown) 

normsp=  cshift (norms ,  1,  sdown) 

a  =  atemp 

norms  =  ntemp 

do  k=l,p 

utemp(k,:,:)  =  up(k,:,:) 
up(k,:,:)  =cshift(u(k, : , :) ,  1,  sdown) 
u(k,:,:)  =  utemp(k, : , :) 
enddo 

C  end  odd  index  of  even  sweep 

C  start  even  index 

alpha  =  2*spread(sum(a*ap,2) ,2,n) 

beta  =  normsp  -norms 

gamma  *  sqrt((alpha*alpha)+(beta*beta)) 

ortho  =  0.25*alpha*alpha  -  deltas*min (norms,  normsp) 

rotate  =  (norms. ge. deltas) .and. (normsp. ge. deltas) .and. (ortho. ge.O) 

.and. (row.ne.m) 

where  ((beta. ge.O) .and. rotate) 

c  *  sqrt((gamma+beta)/(2.0*gamma))  ! cosine  term 

s  *  alpha  /  (2*gamma*c)  !sine  term 

atemp  =  s*a  ♦  c*ap 

ntemp  *  s*s*norms  ♦  c*c*normsp  +  alpha*c*s 
ap  =  c*a  -  s*ap 

normsp=  c*c*norms  +  s*s*normsp  -  alpha*c*s 
a  =  atemp 
norms  *  ntemp 
irot2  =  irot2  +1 
endwhere 

where  ( (beta. It. 0) .and. rotate) 

s  =  sign(sqrt( (gamma-beta)/ (2. 0*gamma))  ,  alpha) 
c  *  alpha  /  (2.0*gamma*s) 
atemp  ■  s*a  +  c*ap 

ntemp  =  s*s*norms  +  c*c*normsp  +  alpha*c*s 
ap  *  c*a  -  s*ap 

normsp=  c*c*norms  +  s*s*normsp  -  alpha*c*s 
a  =  atemp 
norms  *  ntemp 
irot2  =  irot2  +1 
endwhere 
do  k=l  ,p 

where  (rotate) 

utemp(k,:,:)  *  s*u(k,:,:)  +  c*up(k,:,:) 
up(k,:,:)  =  c*u(k,:,:)  -  s*up(k,:,:) 

u(k,:,:)  =  utemp(k, : , :) 

endwhere 
enddo 

where  ((beta.gt .0) .and. ( .not .rotate) .and. (row.ne .m)) 
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atemp  =  ap 
ntemp  =  normsp 
ap  =  a 
normsp =  norms 
a  =  atemp 
norms  =  ntemp 
endwhere 
do  k=l,p 

where  ((beta.gt .0) .and. (.not .rotate) . and. (row.ne.m)) 
utemp(k,:,:)  =  up(k,:,:) 
up(k, : , :)  =  u(k, : , :) 

u(k, : , =  utempCk, : , 
endwhere 
enddo 

C  communicate  (a,  ap)  to/from  processors  aligned  with  odd  rows 

atemp  -  a 
ntemp  ■  norms 
a  *  cshiftCap,  1,  sup) 
norms  *  cshift (normsp,  1,  sup) 
ap  *  atemp 
normsp=  ntemp 
do  k=l,p 

utemp(k,:,:)  =  u(k,:,:) 

u(k, : , :)  »  cshift(up(k, : , :) ,  1,  sup) 

up(k, : , :)  ■  utemp (k , : , :) 

enddo 

enddo  !  end  even  sweep 
irotl  =  irotl  +  irot2 
numrotate  *  sum(irotl(l :m, 1) , 1) 

print*,’  sweep  ’,  isweep,  ’  ’.numrotate,’  rotations’ 
if  (numrotate. eq.O)  goto  300 
if  (isweep. eq.numsweep)  goto  300 
goto  100 
300  continue 

print*, ’done  rotation. . .calculating  singular  values...’ 
return 

end  subroutine  svdcore 

subroutine  evaluatel  (a, u,v,sv,irank, deltas, m,n) 
integer  m,  n,  irank 

real  a(m,n),  v(n,n),  u(m,m),  sv(n),  deltas 
integer  row(m,m),  col(m,m),  i,  j 
real  tl(m,m),  t2(n,n) ,  t3(n,n) 

layout  a( :news, :news) ,  u(:news, inews) ,  v( mews, mews) 
layout  8v(:news) 

layout  row( mews,  mews) ,  col ( mews,  mews) 
layout  tl(:new8,  mews) ,  t2( mews,  mews)  ,  t3(  mews,  mews) 


cmf$ 
cmf  $ 
cmf  $ 
cmf  $ 
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C  calculate  singular  values  and  rank 

t3  =  a(l:n,  l:n) 
sv  =  sqrt(sum(t3*t3,2)) 
irank  =  count(sv.gt.sqrt(deltas)) 

C  calculate  v 

t2  =  spread  (sv,2,n) 
v  =  0.0 

where  (t2.gt.O)  v  =  t3/t2 
C  debug  codes  beyond  the  next  statement 

C  return 

C  detailed  check 

print*, 'max  min  of  v’  .maxval(v),  minval(v) 

C  evaluate  VtV 

t2  =  matmul  (transpose (v) ,  v) 

forall  (i=l:m,  j*i:m)  row(i,j)=  i 

forall  (i=l:m,  j=l:m)  col(i,j)=  j 

print*, ’max val  VVt  off  diagonal  ’,  maxval(abs(t2) , 

.  mask=(row(l:n,l:n) .ne.col(i:n,l:n))) 

C  evaluate  UUt 

tl  =  matmul  (transpose (u) ,  u) 
print*, ’maxval  UUt  off  diagonal  ’, 

.  maxval (abs (t 1) ,  mask=(row.ne.col)) 

100  return 

end  subroutine  evaluatel 

subroutine  one2two(a,ap,b,m,n) 
integer  m,  n 

real  a(m,n) ,  ap(m,n) ,  b(2*m,n) 

cmf$  layout  a(:new3,  :news) ,  ap(:news,  inews),  b(:news,  mews) 
forall  (i=l:m,  j=l:n)  a(i,j)=  b(2*(i-l)  +1,  j) 

forall  (i=l:m,  j=l:n)  ap(i,j)=  b(2*i,  j) 
return 
end 


subroutine  two2one(a,ap,b,m,n) 
integer  m,  n 

real  a(m,n),  ap(m,n),  b(2*m,n) 

cmf$  layout  a(:news,  mews),  ap(:nevs,  mews),  b(:news,  mews) 
forall  (i=l:2*m-l:2,  j«l:n)  b(i,j)=  a(i+((i-l)/2) ,  j) 
forall  (i=2:2*m:2,  j=l:n)  b(i,j)=  ap(l+((i-l)/2) , j) 

return 
end 

subroutine  one2p(u,up,ub,p,m,n) 
integer  p,  m,  n,  X 

real  u(p,m,n),  up(p,m,n),  ub(2*m,p*n) 
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cmf$  layout  u(:serial,  mews,  mews),  up(:serial,  mews,  mews) 
cmf$  layout  ub(:news,  mews) 
do  k  =l,p 

forall  (i=l :2*m-l :2,  j=k:(k+n-l))  ub(i,j)=  u(k,l+((i-l)/2) ,  j) 
forall  (i=2:2*m:2,  j=k: (k+n-1) )  ub(i,j)=  up(k,l+((i-l)/2) , j) 
enddo 
return 
end 

subroutine  p2one(u,up,ub,p,m,n) 
integer  p,  m,  n,  k 

real  u(p,m,n),  up(p,m,n),  ub(2*m,p*n) 
cmf$  layout  u(: serial,  mews,  mews),  up(: serial,  mews,  mews) 
cmf$  layout  ub(mews,  mews) 
do  k  »l,p 

forall  (i*l:2*m-l:2,  j*l:n)  ub(i, j+(k-l)*n)*  u(k,l+((i-l)/2) ,  j) 
forall  (i*2:2*m:2,  j»l:n)  ub(i, j+(k-l)*n)«  up(k,l+((i-l)/2) , j) 
enddo 
return 
end 

subroutine  evaluate2  (a,u,a_original,m,n) 
integer  m,  n,  irank 
real  a(m,n) ,  u(m,m) ,  aboriginal (m,n) 
real  tl(m,n) 

cmf$  layout  a( mews,  mews) ,  u(mews,mews) 
cmf$  layout  a_original(  mews ,  mews) 

cmf$  layout  tl(  mews,  mews) 

C  evaluate  USVt 

C  (SVt)  is  already  in  matrix  a 

tl  =  matmul (transpose (u) ,  a)  -  a.original 

print*, ’error  =  max(abs(  U  S  Vt  -  A  ))  is  ’,  maxval(abs(tl) ) 
return 

end  subroutine  evaluate2 
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CM  FORTRAN  CODE  FOR  TEST  DRIVER 


program  svdtest 

C  test  program  for  singular  value  decomposition  (svd)  subroutine 

integer,  parameter::  mm=512,  nn=512 
integer  m  ,n  , irank,  isveep,  b(mm,mm),  ans,  i 
real  a(mm,  nn) ,  u(mm,nn) ,  v(mm,nn),  sv(mm) 
real  eps ,  error 

cmf$  layout  a(:news,  :news),  b(:news,  :nevs) ,  u( :news , inews) 

cmf$  layout  v(:news,  mews),  sv(mews) 

interface 

subroutine  svd  (ab,ub,vb,sv,m,n,irank, isweep, eps) 
integer  m,n,irank, isweep 

real  ab(2*m,n),  ub(2*m,n),  vb(2*m,n) ,  sv(2*m),eps 
cmf$  layout  ab(:news,  mews),  ub(:news,  mews),  vb(:news,  mews) 

cmf$  layout  sv(:news) 

end  interface 

call  CM_set_safety_mode(0) 
print*, ’eps  (default  to  2.22e-16)’ 
read*, eps 

if  (eps.le.O)  eps  =  2.22e-16 
print*, eps 
a  =  0.0 

print*, ’m,  n  of  matrix  ?’ 
read*,  m,n 
print* ,m,n 

print*, ’max  number  of  sweep  ?’ 

read*,isweep 

print* , isweep 

print*, ’creating  a  random  matrix’ 
call  cmf .random  (a(l:m,l:n)) 
print*, 'maxval  matrix  =’ ,maxval(a(l :m, 1 m) ) 
print*, ’call  svd  routine’ 

call  svd(a(l:m,l:n) ,u(l:m,l:n) ,  v(l:m,l:n),  sv(l:m), 

m/2,  n,  irank,  isweep,  eps) 
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print*,  ’exit  svd  routine’ 
stop 

end 
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